Back to Article
ChatGPT & Google Gemini prompt
Download Notebook

ChatGPT & Google Gemini prompt

In [1]:
import openai
from tqdm import tqdm
from causal_chains.CausalChain import util  # https://github.com/helliun/causal-chains
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from dotenv import load_dotenv
import os
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import google.generativeai as genai
from google.generativeai.types import HarmCategory, HarmBlockThreshold
import pathlib
import textwrap
In [2]:
who_data = pd.read_csv("../data/corpus.csv")

Causal text mining (CTM) has been applied to various NLP tasks such as knowledge base construction, question answering, and text summarization The methodologies of CTM often involve two phases: causal sequence classification and causal span detection

  • The causal sequence classification is a binary classification task to detect whether the sequence entails causality or not. This task requires a deep understanding of commonsense knowledge, as determining causality necessitates the comprehension of underlying real-world principles and contexts Gao et al.
  • The causal span detection task aims to distinguish between cause and effect arguments present in causal sequences. This task requires a precise understanding of a complex context that comprises multiple entities and events to discern which parts of sequences correspond to causes and effects and which are noise, in addition to the capabilities previously mentioned.

Biomedical causal relations extracted from different resources, such as online journals, books, and reports, can be leveraged to form causal chains, which may result in the discovery of previously unknown relations.

CTM include various approaches

  • <font color=“#00b050”, style = “bold”>Knowledged-based system (expert opinions): relied heavily on domain experts to define rules and patterns for identifying causal relationships in text.
  • Machine learning: Naive Bayes, Support Vector Machines (SVM), and Conditional Random Fields (CRF) were used to classify and extract causal relationships. These models required extensive feature engineering and relied on lexical and syntactic features such as keywords (“due to”, “can cause”), part-of-speech tags, and dependency relations. [[2024-05-13#Traditional machine learning methods]]
  • Deep learning techniques
    • Multiview Convolutional Neural Networks (MVC): This approach leverages multiple views of the input text to capture different aspects of the data. It can combine syntactic, semantic, and positional information to enhance causal relation extraction.
    • Recurrent Neural Networks (RNN): BiLSTM (Bidirectional Long Short-Term Memory) models: These models can capture long-range dependencies in text by processing it in both forward and backward directions. Attention mechanisms are often integrated to focus on relevant parts of the text that contribute to causal relationships.
    • Graph Neural Networks (GNNs): GNNs can model text as graphs, where nodes represent entities or concepts and edges represent relationships. This approach is beneficial for capturing complex causal structures.
    • Transformer Models
      • Bidirectional Encoder Representations from Transformers (BERT): BERT is pre-trained on large corpora and can be fine-tuned for specific tasks. It captures context from both directions, making it effective for understanding complex dependencies in text. Variants like BioBERT (for biomedical text) and ClinicalBERT are tailored for specific domains.
      • ELMo (Embeddings from Language Models): ELMo generates contextualized word embeddings by considering the entire sentence, providing richer representations for identifying causal relationships.

LLMs have demonstrated impressive performance across numerous NLP tasks with zero-shot or few-shot in-context learning without requiring supervised training versus traditional encoder-based models

ChatGPT often demonstrates competitive results in few-shot settings even in financial domain-specific datasets and Japanese datasets, even though a fully trained encoder-based model outperforms ChatGPT. The result indicates that ChatGPT is a good starting point for various datasets especially when training data are unavailable, but not a good causal text miner when the training data are readily available.

The result indicates that ChatGPT serves as a good starting point when training data are limited as its performance is not influenced by the data size. In contrast, encoder models depend heavily on data size

ChatGPT struggles with complex causality types, especially those of intra/inter-sentential and implicit causality

Sample sentence: The sudden appearance of unlinked cases of mpox in South Africa without a history of international travel, the high HIV prevalence among confirmed cases, and the high case fatality ratio suggest that community transmission is underway, and the cases detected to date represent a small proportion of all mpox cases that might be occurring in the community; it is unknown how long the virus may have been circulating. This may in part be due to the lack of early clinical recognition of an infection with which South Africa previously gained little experience during the ongoing global outbreak, potential pauci-symptomatic manifestation of the disease, or delays in care-seeking behaviour due to limited access to care or fear of stigma.

Expected results:

  • Cause: lack of early clinical recognition of an infection -> Effects: community transmission of mpox
  • Cause: pauci-symptomatic manifestation of the disease -> Effects: lack of early clinical recognition of an infection
  • Cause: delays in care-seeking behaviour -> Effects: lack of early clinical recognition of an infection
  • Cause: limited access to care -> Effect: delays in care-seeking behaviour
  • Cause: fear of stigma -> Effect: delays in care-seeking behaviour
In [3]:
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
gemini_api_key = os.getenv("GEMINI_API_KEY")

# Initialize the Gemini API client
genai.configure(api_key=gemini_api_key)
safety_filters = {
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE
    # ... add other categories if you need them and set them to BLOCK_NONE
}

class CausalChain:

    one_shot_example = """
    Example of disease transmission
    Text: The sudden appearance of unlinked cases of mpox in South Africa without a history of international travel, the high HIV prevalence among confirmed cases, and the high case fatality ratio suggest that community transmission is underway, and the cases detected to date represent a small proportion of all mpox cases that might be occurring in the community; it is unknown how long the virus may have been circulating. This may in part be due to the lack of early clinical recognition of an infection with which South Africa previously gained little experience during the ongoing global outbreak, potential pauci-symptomatic manifestation of the disease, or delays in care-seeking behaviour due to limited access to care or fear of stigma.
    Question: Which drivers cause the emergence or transmission of an infectious disease outbreak in the region?
    Answer: 
    Cause: limited access to care (Public Health Systems) -> Effect: delays in care-seeking behaviour (Social & Demographic Change)
    Cause: fear of stigma (Social & Demographic Change) -> Effect: delays in care-seeking behaviour (Social & Demographic Change)
    Cause: delays in care-seeking behaviour (Social & Demographic Change) -> Effect: lack of early clinical recognition of an infection (Public Health Systems)
    Cause: pauci-symptomatic manifestation of the disease (Disease characteristics) -> Effect: lack of early clinical recognition of an infection (Public Health Systems)
    Cause: lack of early clinical recognition of an infection (Public Health Systems) -> Effect: community transmission of mpox (Disease transmission)
    """
    
    two_shot_example = """
    Example of disease emergence
    Text: The risk of dengue is similar across regions, countries, and within countries. Factors associated with an increasing risk of dengue epidemics and spread to new countries include: early start and longer duration of dengue transmission seasons in endemic areas; changing distribution and increasing abundance of the vectors (Aedes aegypti and Aedes albopictus); consequences of climate change and periodic weather phenomena (El Nino and La Nina events) leading to heavy precipitation, humidity, and rising temperatures favouring vector reproduction and virus transmission;
    Question: Which drivers cause the emergence or transmission of an infectious disease outbreak in the region?
    Answer: 
    Cause: consequences of climate change and periodic weather phenomena (Globalization & Environmental Change) -> Effect: vector reproduction and virus transmission (Disease characteristics)
    Cause: vector reproduction and virus transmission (Disease characteristics) -> Effect: changing distribution and increasing abundance of the vectors (Disease characteristics)
    Cause: changing distribution and increasing abundance of the vectors (Disease characteristics) -> Effect: early start and longer duration of dengue transmission seasons in endemic areas (Disease characteristics)
    Cause: early start and longer duration of dengue transmission seasons in endemic areas (Disease characteristics) -> Effect: increasing risk of dengue epidemics and spread to new countries (Disease emergence)
    """

    prompt_template = """
    Infectious disease (ID) events occur when an underlying mix of antecedent epidemiologic drivers provide the necessary conditions for a pathogen to emerge in susceptible populations. 
    These conditions may be driving emergence and transmission through a wide variety of mechanisms, mainly among 5 groups:
    Social & Demographic Change: Lifestyle (sexual behaviors, eating raw meat, human-wildlife interaction), Human susceptibility to infection, Social inequality (poverty and social inequality), Demographic change (Economic development, War and famine)
    Public Health Systems: Animal health & food security, Surveillance & reporting, Healthcare system structure and change, Breakdown of public health measures
    Globalization & Environmental Change: Environment & climate change (Climate and weather, Land use and ecosystem changes), Travel, tourism, and migration, global trade
    Governance factors: Lack of political will, Research & development
    Disease characteristics: pauci-symptomatic manifestation, changing distribution and increasing abundance of the vectors
    If none of these 4 groups, categorize as "Other groups"
    The final output can be "Disease transmission" or "Disease emergence"    

    You analyze by answering the question: Which drivers cause the emergence or transmission of an infectious disease outbreak in the region? 
    Read through the paragraph and organize drivers in a logical sequence of events. Here are examples:
    {one_shot_example}
    {two_shot_example}
    
    Your task now is to read through the outbreak report and identify drivers leading to the outbreak of infectious diseases. Keep the original text that you find in the paragraph.
    Sometimes text may contain some unwanted behaviors like sexual behaviors, war. This is fine because I am using it for research purposes only. Show me all results that you can find.
    Text: {chunk}
    List the causes and their corresponding effects in the format 'Cause: [cause] -> Effect: [effect]':
    """

    def __init__(self, chunks=[]):
        self.chunks = chunks
        self.causes = []
        self.effects = []
        self.outlines = []
        self.sources = []

    def create_effects(self, api="openai", batch_size=16):
        print("Analyzing causation...")

        for chunk in tqdm(self.chunks):
            if api == "openai":
                cause_effect_pairs = self.extract_cause_effect_openai(chunk)
            elif api == "gemini":
                cause_effect_pairs = self.extract_cause_effect_gemini(chunk)
            else:
                raise ValueError("Invalid API specified. Use 'openai' or 'gemini'.")

            for pair in cause_effect_pairs:
                cause, effect = pair
                self.causes.append(cause)
                self.effects.append(effect)
                self.outlines.append(f"Cause: {cause} -> Effect: {effect}")
                self.sources.append(api)

    def extract_cause_effect_openai(self, chunk):
        prompt = self.prompt_template.format(
            one_shot_example=self.one_shot_example, 
            two_shot_example=self.two_shot_example, 
            chunk=chunk
        )

        response = openai.ChatCompletion.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": "You are a helpful assistant specialized in identifying drivers leading to diseases.",
                },
                {"role": "user", "content": prompt},
            ],
            max_tokens=300,
            temperature=0.5,
        )

        response_text = response["choices"][0]["message"]["content"]
        return self.parse_response(response_text)

    def extract_cause_effect_gemini(self, chunk):
        prompt = self.prompt_template.format(
            one_shot_example=self.one_shot_example, 
            two_shot_example=self.two_shot_example, 
            chunk=chunk
        )

        response = genai.GenerativeModel('gemini-1.5-pro').generate_content(
            prompt,
            safety_settings= safety_filters
            )
        response_text = response.text
        return self.parse_response(response_text)

    @staticmethod
    def parse_response(response_text):
        cause_effect_pairs = []
        for line in response_text.split("\n"):
            if "Cause:" in line and "-> Effect:" in line:
                cause = line.split("Cause:")[1].split("-> Effect:")[0].strip()
                effect = line.split("-> Effect:")[1].strip()
                cause_effect_pairs.append((cause, effect))
        return cause_effect_pairs

def create_causes_effects_dataframe(causes, effects, sources):
    def split_cause_effect(value):
        if "(" in value and ")" in value:
            main_text, group = value.rsplit("(", 1)
            main_text = main_text.strip()
            group = group[:-1].strip()  # Remove the closing parenthesis
            return main_text, group
        return value, "Unknown"

    cause_texts, cause_groups = zip(*[split_cause_effect(cause) for cause in causes])
    effect_texts, effect_groups = zip(*[split_cause_effect(effect) for effect in effects])

    data = {
        "Cause": cause_texts,
        "Cause_group": cause_groups,
        "Effect": effect_texts,
        "Effect_group": effect_groups,
        "Source": sources
    }
    
    df = pd.DataFrame(data)
    return df

Example of text to ask LLMs

In the Democratic Republic of the Congo, most reported cases in known endemic provinces continue to be among children under 15 years of age, especially in young children. Infants and children under five years of age are at highest risk of severe disease and death, particularly where prompt optimal case management is limited or unavailable. The number of cases reported weekly remains consistently high while the outbreak continues to expand geographically. High test positivity among tested cases in most provinces also suggests that undetected transmission is likely ongoing in the community. Transmission of mpox due to clade I MPXV via sexual contact in key populations was first identified in the Democratic Republic of the Congo in 2023. In South Kivu province, mpox transmission is sustained through human-to-human contact (sexual and non-sexual)

In [128]:
text = who_data["Text"][9]
chunks = util.create_chunks(text)
cc = CausalChain(chunks)
In [130]:
cc.create_effects(api="openai")
Analyzing causation...
Analyzing causation...
  0%|                                                   | 0/12 [00:00<?, ?it/s]  8%|███▌                                       | 1/12 [00:03<00:40,  3.67s/it] 17%|███████▏                                   | 2/12 [00:06<00:30,  3.09s/it] 25%|██████████▊                                | 3/12 [00:10<00:32,  3.58s/it] 33%|██████████████▎                            | 4/12 [00:12<00:24,  3.01s/it] 42%|█████████████████▉                         | 5/12 [00:15<00:19,  2.78s/it] 50%|█████████████████████▌                     | 6/12 [00:17<00:16,  2.83s/it] 58%|█████████████████████████                  | 7/12 [00:20<00:14,  2.84s/it] 67%|████████████████████████████▋              | 8/12 [00:23<00:11,  2.85s/it] 75%|████████████████████████████████▎          | 9/12 [00:26<00:08,  2.76s/it] 83%|███████████████████████████████████       | 10/12 [00:28<00:05,  2.72s/it] 92%|██████████████████████████████████████▌   | 11/12 [00:33<00:03,  3.40s/it]100%|██████████████████████████████████████████| 12/12 [00:37<00:00,  3.49s/it]100%|██████████████████████████████████████████| 12/12 [00:37<00:00,  3.13s/it]
In [129]:
cc.create_effects(api="gemini")
Analyzing causation...
Analyzing causation...
  0%|                                                   | 0/12 [00:00<?, ?it/s]  8%|███▌                                       | 1/12 [00:12<02:12, 12.07s/it] 17%|███████▏                                   | 2/12 [00:19<01:31,  9.12s/it] 25%|██████████▊                                | 3/12 [00:21<00:56,  6.25s/it] 33%|██████████████▎                            | 4/12 [00:29<00:54,  6.78s/it] 42%|█████████████████▉                         | 5/12 [00:41<00:59,  8.53s/it] 50%|█████████████████████▌                     | 6/12 [00:46<00:44,  7.37s/it] 58%|█████████████████████████                  | 7/12 [00:49<00:30,  6.01s/it] 67%|████████████████████████████▋              | 8/12 [01:01<00:31,  7.87s/it] 75%|████████████████████████████████▎          | 9/12 [01:02<00:17,  5.90s/it] 83%|███████████████████████████████████       | 10/12 [01:05<00:09,  4.79s/it] 92%|██████████████████████████████████████▌   | 11/12 [01:12<00:05,  5.68s/it]100%|██████████████████████████████████████████| 12/12 [01:18<00:00,  5.75s/it]100%|██████████████████████████████████████████| 12/12 [01:18<00:00,  6.57s/it]
In [135]:
df = create_causes_effects_dataframe(cc.causes, cc.effects, cc.sources)
In [136]:
display(df[df['Source'] == 'gemini'])
Cause Cause_group Effect Effect_group Source
0 Limited availability of prompt optimal case ma... Public Health Systems Infants and children under five years of age a... Social & Demographic Change)* gemini
1 ** human-to-human contact sexual and non-sexual) * ** Transmission of mpox Disease Transmission gemini
2 ** sexual contact in key populations ** Unknown ** Transmission of mpox due to clade I MPXV Disease Transmission gemini
3 ** undetected transmission in the community ** Unknown ** High test positivity among tested cases Disease Transmission gemini
4 **lack of timely access to diagnostics in many... Public Health Systems **incomplete epidemiological investigations** Public Health Systems gemini
5 **incomplete epidemiological investigations** Public Health Systems **challenges in contact tracing** Public Health Systems gemini
6 **challenges in contact tracing** Public Health Systems **the outbreak in South Kivu is already spread... Disease transmission gemini
7 eradication of smallpox Public Health Systems immunity gap Social & Demographic Change gemini
8 MPXV continues to move into the immunity gap Disease characteristics human-to-human transmission Disease transmission gemini
9 logistical and resource challenges Public Health Systems limited Surveillance and investigating alerts Public Health Systems gemini
10 limited laboratory capacities Public Health Systems limited Surveillance and investigating alerts Public Health Systems gemini
11 lack of effective dissemination to date of hea... Public Health Systems low awareness of the risks associated with mpox Social & Demographic Change)* gemini
12 low awareness of the risks associated with mpox Social & Demographic Change exposes them to further risk Disease Transmission)* gemini
13 ** River boat travel Globalization & Environmental Change) * ** Outbreaks in Kinshasa Disease transmission gemini
14 ** Co-infections with HIV and other sexually t... Social & Demographic Change) * ** Increased severity of MPXV Disease Transmission gemini
In [137]:
display(df[df['Source'] == 'openai'])
Cause Cause_group Effect Effect_group Source
15 limited or unavailable prompt optimal case man... Public Health Systems high risk of severe disease and death in infan... Social & Demographic Change openai
16 high risk of severe disease and death in infan... Social & Demographic Change consistently high number of cases reported weekly Disease transmission openai
17 consistently high number of cases reported weekly Disease transmission outbreak continues to expand geographically Disease transmission openai
18 high test positivity among tested cases in mos... Public Health Systems undetected transmission likely ongoing in the ... Disease transmission openai
19 sexual contact in key populations Social & Demographic Change transmission of mpox due to clade I MPXV Disease transmission openai
20 human-to-human contact (sexual and non-sexual)... Social & Demographic Change sustained mpox transmission Disease transmission openai
21 sexual contact Social & Demographic Change faster transmission Disease transmission openai
22 immune suppression, especially among those wit... Social & Demographic Change risk factors for severe disease and death amon... Disease transmission openai
23 prevalence of HIV in the general adult populat... Social & Demographic Change higher risk of severe disease and death among ... Disease transmission openai
24 sustained human-to-human sexual transmission o... Social & Demographic Change additional public health impact Public Health Systems openai
25 higher HIV prevalence in the eastern provinces Social & Demographic Change higher risk of severe disease and death among ... Disease transmission openai
26 lack of timely access to diagnostics in many a... Public Health Systems incomplete epidemiological investigations Public Health Systems openai
27 incomplete epidemiological investigations Public Health Systems challenges in contact tracing Public Health Systems openai
28 challenges in contact tracing Public Health Systems outbreak spreading into the wider community Disease transmission openai
29 outbreak spreading into the wider community Disease transmission occurrence of cases among a broad range of occ... Disease transmission openai
30 new features of human-to-human transmission Disease characteristics further rapid expansion of the outbreak Disease transmission openai
31 further rapid expansion of the outbreak Disease transmission geographic expansion to new areas, such as Kin... Disease transmission openai
32 geographic expansion to new areas, such as Kin... Disease transmission increase in suspected cases reported Disease transmission openai
33 travel from endemic areas Globalization & Environmental Change cases in newly affected provinces Disease transmission openai
34 secondary or sustained human-to-human transmis... Disease transmission cases in newly affected provinces Disease transmission openai
35 immunity gap left following eradication of sma... Social & Demographic Change MPXV continues to move Disease transmission openai
36 logistical and resource challenges Public Health Systems limited surveillance and investigating alerts Public Health Systems openai
37 limited laboratory capacities Public Health Systems limited surveillance and investigating alerts Public Health Systems openai
38 reliance on the support of WHO and other partners Public Health Systems response capacities to mpox in the country Public Health Systems openai
39 ongoing immunogenicity and safety studies of M... Governance factors national immunization technical advisory group... Governance factors openai
40 national immunization technical advisory group... Governance factors use of mpox vaccines in the country for person... Public Health Systems openai
41 recommendations for preferred use of LC16 in c... Governance factors immunization strategy for different age groups Public Health Systems openai
42 intention to vaccinate persons at risk Governance factors use of LC16 and MVA-BN vaccinia-based mpox vac... Governance factors openai
43 request for authorization of temporary use of ... Governance factors regulatory review by ACOREP Governance factors openai
44 regulatory review by ACOREP Governance factors temporary use of these vaccines Governance factors openai
45 planning of further clinical efficacy and safe... Governance factors further clinical efficacy and safety studies f... Governance factors openai
46 developing emergency response immunization str... Public Health Systems persons and areas at risk are targeted Public Health Systems openai
47 extensive consultation internally, with WHO an... Governance factors development of emergency response immunization... Public Health Systems openai
48 clinical efficacy studies of tecovirimat Governance factors potential future access to tecovirimat Public Health Systems openai
49 study expected to complete recruitment in 2024 Governance factors delayed access to tecovirimat until study comp... Public Health Systems openai
50 low awareness of the risks associated with mpo... Public Health Systems increased risk of disease transmission in the ... Disease transmission openai
51 lack of effective dissemination of health mess... Public Health Systems increased risk for key populations such as sex... Social & Demographic Change openai
52 increased risk for key populations such as sex... Social & Demographic Change further exposure to mpox Disease transmission openai
53 co-infections with HIV and other sexually tran... Social & Demographic Change outbreaks in newly reported areas in southern ... Disease transmission openai
54 river boat travel Globalization & Environmental Change outbreaks in the city of Kinshasa Disease transmission openai
55 under detection or underreporting of transmission Public Health Systems significant under detection or underreporting ... Public Health Systems openai
56 resources to respond over such a wide geograph... Public Health Systems insufficient response to the outbreak Public Health Systems openai
57 resource mobilization is slow Public Health Systems insufficient response to the outbreak Public Health Systems openai
58 public awareness remains limited Social & Demographic Change insufficient response to the outbreak Public Health Systems openai
59 resources are scarce Public Health Systems insufficient response to the outbreak Public Health Systems openai
60 technical and financial support is needed Public Health Systems insufficient response to the outbreak Public Health Systems openai
61 insufficient response to the outbreak Public Health Systems continuation of disease transmission Disease transmission openai